This document details visualization in anvio
Anvio is run in a dedicated environment.
conda activate anvio-7.1
Get bin info into a format that anvio can use. This means concatenating the bin files for each method, so there’s a list of which contig/read goes in which bin
# get all bin directories
path <- list.dirs("../data/Bins")
# for loop for each binning method
for (i in 2:7){
DF <- NULL
pathname <- path[i]
filelist <- list.files(paste0(pathname, "/"))
# get list of all contigs and reads for all bins into 1 tsv file
for (filename in filelist){
df <- read.csv(paste0(pathname, "/", filename), header = F)
df <- as.data.frame(df)
colnames(df) <- "read"
df$bin <- str_replace(filename, "[.]", "_")
# change names if a number is at the beginning of the bin name
if (basename(pathname) == "24_sample_bam_bins"){
df$bin <- str_replace(df$bin, "24", "twentyfour")
}
if (basename(pathname) == "47_sample_bam_bins"){
df$bin <- str_replace(df$bin, "47", "fortyseven")
}
DF <- rbind(DF, df)
}
write.table(DF, paste0("../output/all_bins/", basename(pathname), ".tsv"), row.names = F, col.names = F, quote = F, sep = "\t")
}
Examine tsv files.
tsv_output <- read.csv("../output/all_bins/assembly_bins.tsv", sep = "\t")
kable(head(tsv_output, 5))
| MG1058_s821.ctg000852l | assembly_bin_1 |
|---|---|
| MG1058_s1105.ctg001148l | assembly_bin_1 |
| MG1058_s1585.ctg001645l | assembly_bin_1 |
| MG1058_s1820.ctg001893l | assembly_bin_1 |
| MG1058_s645.ctg000674l | assembly_bin_10 |
| MG1058_s914.ctg000951l | assembly_bin_10 |
Get the bins into the anvio database already created.
# Example for one bin import, change import and -C for each
anvi-import-collection "./github/jordan-marinimicrobia/output/all_bins/short_reads_bam_bins.tsv" -p "./Downloads/plus_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_plus/1058_P1_2018_585_0.2um_assembly_plus.db" --contigs-mode -C shortreads
anvi-interactive -p "./Downloads/plus_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_plus/1058_P1_2018_585_0.2um_assembly_plus.db"
Example of what the interactive browser looks like with bins. Anvio calculates statistics like completion and redundancy for each bin.
Anvio interactive browser
Dig into “contaminated” bins to see how/why they are contaminated. Reminder that “.” is changed to “_” and 24 and 47 are written out in the anvi bin database.
anvi-refine -p "./Downloads/assembly_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_only/1058_P1_2018_585_0.2um_assembly.db" -C shortreads -b short_reads_bam_bin_163
Example of a contaminated bin. The coverage is not consistent, there are many branches within the clustering algorithm anvio uses to group sequences, and there are many duplicated single copy core genes.
Anvio interactive display of a contaminated bin
I used anvi-interactive to get a summary of all bins in each bin collection. This is an example output file that contains information on size of bins, contamination, etc.
summary <- read.table("../output/anvio_outputs/assembly_plus_summary.txt", sep = "\t", header = T)
summary(summary)
## bins total_length num_contigs N50
## Length:274 Min. : 202581 Min. : 1.00 Min. : 10203
## Class :character 1st Qu.: 310692 1st Qu.: 9.00 1st Qu.: 12386
## Mode :character Median : 488296 Median : 23.00 Median : 14620
## Mean : 879731 Mean : 52.00 Mean : 74294
## 3rd Qu.: 896170 3rd Qu.: 50.75 3rd Qu.: 50200
## Max. :20079188 Max. :1582.00 Max. :3034959
## GC_content percent_completion percent_redundancy t_domain
## Min. :26.47 Min. : 0.00 Min. : 0.00 Length:274
## 1st Qu.:38.99 1st Qu.: 0.00 1st Qu.: 0.00 Class :character
## Median :45.84 Median : 0.00 Median : 0.00 Mode :character
## Mean :47.44 Mean : 13.16 Mean : 16.19
## 3rd Qu.:57.02 3rd Qu.: 23.59 3rd Qu.: 0.00
## Max. :69.50 Max. :100.00 Max. :2053.52
## t_phylum t_class t_order t_family
## Length:274 Length:274 Length:274 Length:274
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## t_genus t_species
## Length:274 Length:274
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Create anvio dbs for my bins and annotate them with COG, Kegg, HMMs, and tRNAs. Use the interactive database to visualize the pangenome, and find variable regions of the Sulfitobacter genome. These I will dive into further in the next section.
anvi-gen-contigs-database -f sulf_genomes/assembly_plus_bin_4.fa -o sulfbin4.db
anvi-run-hmms -c sulf_genomes/dbs/sulfbin4.db
anvi-run-scg-taxonomy -c sulf_genomes/dbs/sulfbin4.db
anvi-scan-trnas -c sulf_genomes/dbs/sulfbin4.db
anvi-run-ncbi-cogs -c sulf_genomes/dbs/sulfbin4.db
anvi-run-kegg-kofams -c sulf_genomes/dbs/sulfbin4.db
anvi-gen-genomes-storage -e sulf-external-genomes.txt \
-o sulf-GENOMES.db
anvi-pan-genome -g sulf-GENOMES.db -n sulfitobacter
anvi-display-pan -g sulf-GENOMES.db -p sulfitobacter/sulfitobacter-PAN.db
Pangenome visualization.
Sulfitobacter pangenome